Let’s continue working with the horseshoe crab data from last time. Today, we’ll fit three models under different assumptions, evaluating their usefulness.

suppressPackageStartupMessages(library(tidyverse))
crab <- read_table("https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt", col_names = FALSE) %>% 
  select(-1) %>% 
  setNames(c("colour","spine","width","weight","n_male")) %>% 
  mutate(colour = factor(colour),
         spine  = factor(spine))
Parsed with column specification:
cols(
  X1 = col_integer(),
  X2 = col_integer(),
  X3 = col_integer(),
  X4 = col_double(),
  X5 = col_double(),
  X6 = col_integer()
)
knitr::kable(head(crab))
colour spine width weight n_male
2 3 28.3 3.05 8
3 3 26.0 2.60 4
3 3 25.6 2.15 0
4 2 21.0 1.85 0
2 3 29.0 3.00 1
1 2 25.0 2.30 3

Case study: how do features of nesting female horseshoe crabs influence the number of males found nearby? Let’s see how carapace width influences the mean number of males nearby.

p <- ggplot(crab, aes(width, n_male)) + 
  geom_point(alpha=0.25) +
  labs(x = "Carapace Width", 
       y = "No. males\nnearby") +
  theme_bw() +
  theme(axis.title.y = element_text(angle=0, vjust=0.5))
plotly::ggplotly(p)

Data source: H. Jane Brockmann’s 1996 paper; found online here; another regression demo with this data is found here.

Approach 1: Estimate regression curve / model function locally

Optimize the loess fit by eye. Just modify span, to keep things simple.

grid <- seq(min(crab$width), max(crab$width), length.out=100)
grid_df <- tibble(width = grid)
model1 <- loess(n_male ~ width, data=crab, degree=2, span=0.5)
grid_df %>% 
  mutate(., yhat = predict(model1, .)) %>% #. specifying new data i.e grid_df
  ggplot(aes(width, yhat)) +
  geom_line(colour="blue") +
  geom_point(data=crab, mapping=aes(width, n_male), alpha=0.25)

What’s the error of this model? Training error is fine.

resid1 <- crab$n_male - predict(model1)
(error1 <- mean(resid1^2))
[1] 8.64852

How well does this model answer our original question?

Approach 2: Linear Regression

Fit a linear regression model

Fit a linear regression model. What’s the error?

model2 <- lm(n_male ~ width, data=crab)
ggplot(crab, aes(width, n_male)) +
    geom_point(alpha=0.25) +
    geom_smooth(method="lm", se=FALSE)

resid2 <- crab$n_male - predict(model2)
(error2 <- mean(resid2^2))
[1] 8.716252

How well does this model answer our original question? Do you see a potential problem with this model? Are any assumptions of linear regression not true? Brainstorm ideas for how to deal with the problems.

can predict negative values. Problem if predicts negative values for reasonable widths.

How can we prevent this w/o losing interpretability?

New approaches

Let’s talk about an alternative approach called Generalized Linear Models (GLM). Topics:

Approach 4: Scientifically motivated model function?

There’s probably none here, but if there was, what then?

LS0tCnRpdGxlOiAiSG9yZXNob2UgY3JhYiBjYXNlIHN0dWR5OiBtb2RlbHMgdW5kZXIgdmFyaW91cyBhc3N1bXB0aW9ucyIKb3V0cHV0OiBodG1sX25vdGVib29rCmVkaXRvcl9vcHRpb25zOiAKICBjaHVua19vdXRwdXRfdHlwZTogaW5saW5lCi0tLQoKTGV0J3MgY29udGludWUgd29ya2luZyB3aXRoIHRoZSBob3JzZXNob2UgY3JhYiBkYXRhIGZyb20gbGFzdCB0aW1lLiBUb2RheSwgd2UnbGwgZml0IHRocmVlIG1vZGVscyB1bmRlciBkaWZmZXJlbnQgYXNzdW1wdGlvbnMsIGV2YWx1YXRpbmcgdGhlaXIgdXNlZnVsbmVzcy4KCmBgYHtyfQpzdXBwcmVzc1BhY2thZ2VTdGFydHVwTWVzc2FnZXMobGlicmFyeSh0aWR5dmVyc2UpKQpjcmFiIDwtIHJlYWRfdGFibGUoImh0dHBzOi8vbmV3b25saW5lY291cnNlcy5zY2llbmNlLnBzdS5lZHUvc3RhdDUwNC9zaXRlcy9vbmxpbmVjb3Vyc2VzLnNjaWVuY2UucHN1LmVkdS5zdGF0NTA0L2ZpbGVzL2xlc3NvbjA3L2NyYWIvaW5kZXgudHh0IiwgY29sX25hbWVzID0gRkFMU0UpICU+JSAKICBzZWxlY3QoLTEpICU+JSAKICBzZXROYW1lcyhjKCJjb2xvdXIiLCJzcGluZSIsIndpZHRoIiwid2VpZ2h0Iiwibl9tYWxlIikpICU+JSAKICBtdXRhdGUoY29sb3VyID0gZmFjdG9yKGNvbG91ciksCiAgICAgICAgIHNwaW5lICA9IGZhY3RvcihzcGluZSkpCmtuaXRyOjprYWJsZShoZWFkKGNyYWIpKQpgYGAKCkNhc2Ugc3R1ZHk6IGhvdyBkbyBmZWF0dXJlcyBvZiBuZXN0aW5nIGZlbWFsZSBob3JzZXNob2UgY3JhYnMgaW5mbHVlbmNlIHRoZSBudW1iZXIgb2YgbWFsZXMgZm91bmQgbmVhcmJ5PyBMZXQncyBzZWUgaG93IGNhcmFwYWNlIHdpZHRoIGluZmx1ZW5jZXMgdGhlIG1lYW4gbnVtYmVyIG9mIG1hbGVzIG5lYXJieS4KCmBgYHtyLCBmaWcud2lkdGg9NiwgZmlnLmhlaWdodD0zfQpwIDwtIGdncGxvdChjcmFiLCBhZXMod2lkdGgsIG5fbWFsZSkpICsgCiAgZ2VvbV9wb2ludChhbHBoYT0wLjI1KSArCiAgbGFicyh4ID0gIkNhcmFwYWNlIFdpZHRoIiwgCiAgICAgICB5ID0gIk5vLiBtYWxlc1xubmVhcmJ5IikgKwogIHRoZW1lX2J3KCkgKwogIHRoZW1lKGF4aXMudGl0bGUueSA9IGVsZW1lbnRfdGV4dChhbmdsZT0wLCB2anVzdD0wLjUpKQpwbG90bHk6OmdncGxvdGx5KHApCmBgYAoKRGF0YSBzb3VyY2U6IFtILiBKYW5lIEJyb2NrbWFubidzIDE5OTYgcGFwZXJdKGh0dHBzOi8vb25saW5lbGlicmFyeS53aWxleS5jb20vZG9pL2Ficy8xMC4xMTExL2ouMTQzOS0wMzEwLjE5OTYudGIwMTA5OS54KTsgZm91bmQgb25saW5lIFtoZXJlXShodHRwczovL25ld29ubGluZWNvdXJzZXMuc2NpZW5jZS5wc3UuZWR1L3N0YXQ1MDQvc2l0ZXMvb25saW5lY291cnNlcy5zY2llbmNlLnBzdS5lZHUuc3RhdDUwNC9maWxlcy9sZXNzb24wNy9jcmFiL2luZGV4LnR4dCk7IGFub3RoZXIgcmVncmVzc2lvbiBkZW1vIHdpdGggdGhpcyBkYXRhIGlzIGZvdW5kIFtoZXJlXShodHRwczovL25ld29ubGluZWNvdXJzZXMuc2NpZW5jZS5wc3UuZWR1L3N0YXQ1MDQvbm9kZS8xNjkvKS4KCgojIyBBcHByb2FjaCAxOiBFc3RpbWF0ZSByZWdyZXNzaW9uIGN1cnZlIC8gbW9kZWwgZnVuY3Rpb24gbG9jYWxseQoKT3B0aW1pemUgdGhlIGxvZXNzIGZpdCBieSBleWUuIEp1c3QgbW9kaWZ5IHNwYW4sIHRvIGtlZXAgdGhpbmdzIHNpbXBsZS4KCmBgYHtyLCBmaWcud2lkdGg9NiwgZmlnLmhlaWdodD00fQpncmlkIDwtIHNlcShtaW4oY3JhYiR3aWR0aCksIG1heChjcmFiJHdpZHRoKSwgbGVuZ3RoLm91dD0xMDApCmdyaWRfZGYgPC0gdGliYmxlKHdpZHRoID0gZ3JpZCkKbW9kZWwxIDwtIGxvZXNzKG5fbWFsZSB+IHdpZHRoLCBkYXRhPWNyYWIsIGRlZ3JlZT0yLCBzcGFuPTAuNSkKZ3JpZF9kZiAlPiUgCiAgbXV0YXRlKC4sIHloYXQgPSBwcmVkaWN0KG1vZGVsMSwgLikpICU+JSAjLiBzcGVjaWZ5aW5nIG5ldyBkYXRhIGkuZSBncmlkX2RmCiAgZ2dwbG90KGFlcyh3aWR0aCwgeWhhdCkpICsKICBnZW9tX2xpbmUoY29sb3VyPSJibHVlIikgKwogIGdlb21fcG9pbnQoZGF0YT1jcmFiLCBtYXBwaW5nPWFlcyh3aWR0aCwgbl9tYWxlKSwgYWxwaGE9MC4yNSkKYGBgCgpXaGF0J3MgdGhlIGVycm9yIG9mIHRoaXMgbW9kZWw/IFRyYWluaW5nIGVycm9yIGlzIGZpbmUuCgpgYGB7cn0KcmVzaWQxIDwtIGNyYWIkbl9tYWxlIC0gcHJlZGljdChtb2RlbDEpCihlcnJvcjEgPC0gbWVhbihyZXNpZDFeMikpCmBgYAoKSG93IHdlbGwgZG9lcyB0aGlzIG1vZGVsIGFuc3dlciBvdXIgb3JpZ2luYWwgcXVlc3Rpb24/CgojIyBBcHByb2FjaCAyOiBMaW5lYXIgUmVncmVzc2lvbgoKIyMjIEZpdCBhIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsCgpGaXQgYSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbC4gV2hhdCdzIHRoZSBlcnJvcj8KCmBgYHtyfQptb2RlbDIgPC0gbG0obl9tYWxlIH4gd2lkdGgsIGRhdGE9Y3JhYikKZ2dwbG90KGNyYWIsIGFlcyh3aWR0aCwgbl9tYWxlKSkgKwogICAgZ2VvbV9wb2ludChhbHBoYT0wLjI1KSArCiAgICBnZW9tX3Ntb290aChtZXRob2Q9ImxtIiwgc2U9RkFMU0UpCnJlc2lkMiA8LSBjcmFiJG5fbWFsZSAtIHByZWRpY3QobW9kZWwyKQooZXJyb3IyIDwtIG1lYW4ocmVzaWQyXjIpKQpgYGAKCkhvdyB3ZWxsIGRvZXMgdGhpcyBtb2RlbCBhbnN3ZXIgb3VyIG9yaWdpbmFsIHF1ZXN0aW9uPyBEbyB5b3Ugc2VlIGEgcG90ZW50aWFsIHByb2JsZW0gd2l0aCB0aGlzIG1vZGVsPyBBcmUgYW55IGFzc3VtcHRpb25zIG9mIGxpbmVhciByZWdyZXNzaW9uIG5vdCB0cnVlPyBCcmFpbnN0b3JtIGlkZWFzIGZvciBob3cgdG8gZGVhbCB3aXRoIHRoZSBwcm9ibGVtcy4KCmNhbiBwcmVkaWN0IG5lZ2F0aXZlIHZhbHVlcy4gUHJvYmxlbSBpZiBwcmVkaWN0cyBuZWdhdGl2ZSB2YWx1ZXMgZm9yIHJlYXNvbmFibGUgd2lkdGhzLgoKSG93IGNhbiB3ZSBwcmV2ZW50IHRoaXMgdy9vIGxvc2luZyBpbnRlcnByZXRhYmlsaXR5PwoKCiMjIE5ldyBhcHByb2FjaGVzCgpMZXQncyB0YWxrIGFib3V0IGFuIGFsdGVybmF0aXZlIGFwcHJvYWNoIGNhbGxlZCBfR2VuZXJhbGl6ZWQgTGluZWFyIE1vZGVsc18gKEdMTSkuIFRvcGljczoKCi0gQXBwcm9wcmlhdGUgdHJhbnNmb3JtYXRpb246IG9uIFk/IE9uIEUoWXxYKT8gTGluayBmdW5jdGlvbiBkZWZpbml0aW9ucy4KLSBJbnRlcnByZXRhdGlvbiBvZiBwYXJhbWV0ZXJzIHVuZGVyIGxvZyBhbmQgbG9naXQgbGlua3MuCi0gRml0dGluZyB0aGUgbW9kZWwgZnVuY3Rpb246IElzIExTICJ2YWxpZCI/IENhbiB3ZSBkbyBiZXR0ZXI/IAogICAgLSBUaGUgdHdvIHR5cGVzIG9mIHBhcmFtZXRyaWMgYXNzdW1wdGlvbnMsIHRoZWlyIHJpc2ssIGFuZCB0aGVpciB2YWx1ZS4KICAgIC0gUmV2aWV3IG9mIE1MRT8KICAgIC0gTm9tZW5jbGF0dXJlOiBQb2lzc29uIHJlZ3Jlc3Npb247IEJpbm9taWFsL0Jlcm5vdWxsaS9Mb2dpc3RpYyByZWdyZXNzaW9uLgoKIyMgQXBwcm9hY2ggMzogTGluayBGdW5jdGlvbgoKRml0IGEgR0xNLiBQbG90IHVzaW5nIGBnZ3Bsb3QyYCwgbWFraW5nIHVzZSBvZiB0aGUgYG1ldGhvZC5hcmdzYCBhcmd1bWVudCBvZiBgZ2VvbV9zbW9vdGgoKWAuIFdoYXQncyB0aGUgZXJyb3I/CgpgYGB7cn0KbW9kZWwzIDwtIGdsbShuX21hbGUgfiB3aWR0aCwgZGF0YT1jcmFiLCBmYW1pbHkgPSBwb2lzc29uKQpnZ3Bsb3QoY3JhYiwgYWVzKHdpZHRoLCBuX21hbGUpKSArCiAgICBnZW9tX3BvaW50KGFscGhhPTAuMjUpICsKICAgIGdlb21fc21vb3RoKG1ldGhvZD0iZ2xtIiwgc2U9RkFMU0UsIEZJTExfVEhJU19JTikKcmVzaWQzIDwtIGNyYWIkbl9tYWxlIC0gcHJlZGljdChtb2RlbDMsIHR5cGUgPSAicmVzcG9uc2UiKSAjIG5lZWQgdHlwZSBvciBlbHNlIHdpbGwgcHJlZGljdCBsaW5lYXIgZnVuY3Rpb24KKGVycm9yMyA8LSBtZWFuKHJlc2lkM14yKSkKYGBgCgpCb251czogZ2VuZXJhdGUgc29tZSBkYXRhIHVuZGVyIHRoZSBmaXR0ZWQgbW9kZWwuIEhvdyBkb2VzIGl0IGNvbXBhcmU/CgojIyBBcHByb2FjaCA0OiBTY2llbnRpZmljYWxseSBtb3RpdmF0ZWQgbW9kZWwgZnVuY3Rpb24/CgpUaGVyZSdzIHByb2JhYmx5IG5vbmUgaGVyZSwgYnV0IGlmIHRoZXJlIHdhcywgd2hhdCB0aGVuPwo=